
Conduct analysis on your client's transaction dataset and identify customer purchasing behaviours to generate insights and provide commercial recommendations.
CLIENT :Category Manager for Chips
OBJECTIVE :Better understand the types of customers who purchase Chips and their purchasing behaviour within the region
ACTION :The insights from following analysis will feed into the supermarket’s strategic plan for the chip category in the next half year
Email from your supervisor/manager, Zilinka :
Hi, Welcome again to the team, we love having new graduates join us! I just wanted to send a quick follow up from our conversation earlier with a few pointers around the key areas of this task to make sure we set you up for success. Below I have outlined your main tasks along with what we should be looking for in the data for each. Examine transaction data – look for inconsistencies, missing data across the data set, outliers, correctly identified category items, numeric data across all tables. If you determine any anomalies make the necessary changes in the dataset and save it. Having clean data will help when it comes to your analysis. Examine customer data – check for similar issues in the customer data, look for nulls and when you are happy merge the transaction and customer data together so it’s ready for the analysis ensuring you save your files along the way. Data analysis and customer segments – in your analysis make sure you define the metrics – look at total sales, drivers of sales, where the highest sales are coming from etc. Explore the data, create charts and graphs as well as noting any interesting trends and/or insights you find. These will all form part of our report to Julia. Deep dive into customer segments – define your recommendation from your insights, determine which segments we should be targeting, if packet sizes are relative and form an overall conclusion based on your analysis. Make sure you save your analysis in the CSV files and your visualisations – we will need them for our report. If you could work on this analysis and send me your initial findings by end of next week that would be great. Looking forward to reviewing your work. Thanks, Zilinka
TASK :Present a strategic recommendation to Julia that is supported by data which she can then use for the upcoming category review however to do so we need to analyse the data to understand the current purchasing trends and behaviours. The client is particularly interested in customer segments and their chip purchasing behaviour. Consider what metrics would help describe the customers’ purchasing behaviour.
OUR APPROACH :
Importing Libraries
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import plotly.express as px
Reading Datasets
raw_pur = pd.read_csv('QVI_purchase_behaviour.csv')
raw_tra = pd.read_excel('QVI_transaction_data.xlsx')
Now we will focus only on Purchase data and perform Univarite Analysis
print(raw_pur.shape)
raw_pur.head(3)
LYLTY_CARD_NBR : Unique identifier of each customer
LIFESTAGE : Social and demographic life stage of a customer. This attribute identifies whether a customer has a family or not and what point in life they are at e.g. are their children in pre-school/primary/secondary school.
PREMIUM_CUSTOMER : Customer segmentation used to differentiate shoppers by the price point of products they buy and the types of products they buy. It is used to identify whether customers may spend more for quality or brand or whether they will purchase the cheapest options.
Check for null values and types of columns we have
raw_pur.info()
raw_pur.isna().sum()
We did not see any null values but since LIFESTAGE feature is of object type, we also check for values with just space
raw_pur[raw_pur.LIFESTAGE == ' ']
Check for entries in PREMIUM_CUSTOMER with space
raw_pur[raw_pur.PREMIUM_CUSTOMER == ' ']
Check for duplicate entries
raw_pur[raw_pur.duplicated()]
Check for entries with duplicate Loyality Card Number i.e. same customer
raw_pur[raw_pur.duplicated(['LYLTY_CARD_NBR'])]
We check how many customers per lifestage are there in our dataset
We find that we have maximum of Retirees, old singles/couples and young singles/couples in our sample set. The lowest records we have are of new families and midage singles/couples.
print(raw_pur.LIFESTAGE.value_counts())
plt.figure(figsize=(12,5))
plt.plot(*zip(*dict(raw_pur.LIFESTAGE.value_counts()).items()))
plt.title('No. of Customers based on Life Stage')
plt.xticks(fontsize=10, rotation=30)
plt.xlabel('LIFESTAGE Classes -->')
plt.ylabel('No. of samples -->')
plt.grid()
plt.show()
Now we check distributions of number of samples for each customer segment
We observe that we have maximum samples from Mainstream segment and lowest from Premium.
print(raw_pur.PREMIUM_CUSTOMER.value_counts())
ax = raw_pur.groupby("PREMIUM_CUSTOMER")['LYLTY_CARD_NBR'].count().plot.bar()
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005), xytext=(14, 1), textcoords='offset points')
No. of customers with different lifestage attribute in each segment
In Mainstream segment we have maximum 8088 YOUNG SINGLES/COUPLES, followed by 6479 RETIREES and 4930 OLD SINGLES/COUPLES
In Budget segment that has next highest number of samples after Mainstream we have maximum of OLDER SINGLES/COUPLES with count of 4929, closely followed by 4675 OLDER FAMILIES and 4454 RETIREES.
In Premium segment we again see highest samples from OLDER SINGLES/COUPLES with count of 4750, followed by RETIREES and YOUNG SINGLES/COUPLES with counts of 3872 and 2574 respectively.
fig = px.sunburst(raw_pur, path=['PREMIUM_CUSTOMER','LIFESTAGE'])
fig.show()
Now we will explore transactional data before combining both dataframes.
print(raw_tra.shape)
raw_tra.head()
DATE : Date of the transaction in excel format
STORE_NBR : Store number from which the corrosponding transaction was done
LYLTY_CARD_NBR : Unique identifier of each customer
TXN_ID : Taxation ID i.e. a number against each transaction, multiple products will have same transaction Id if bought together
PROD_NBR : Product number
PROD_NAME : Name of the product, it also contains brand name, variety of product and weight of product
PROD_QTY : No. of same products bought
TOT_SALES : $$
Check for null values and types of columns we have
raw_tra.info()
raw_tra.isna().sum()
We check high level statistics of the data, like mean, standard deviation, quantiles and mininum, maximum.
raw_tra.describe(include='all')
Now we will explore each and every feature seperately by doing Univariate Analysis
PROD_NAME
We observe that are many products in our dataset are other than Chips, since we are only focusing on Chips so we remove all other non-chips items like Salsa, Cheese Box, Burger rings etc from the dataset.
to_remove = list()
for i in raw_tra.PROD_NAME:
if 'Salsa' in i: to_remove.extend(i)
len(to_remove)
to_remove.extend(['Cheezels Cheese Box 125g',
'Burger Rings 220g',
'Twisties Cheese 270g',
'Twisties Cheese Burger 250g',
'Twisties Chicken270g'])
raw_tra = raw_tra[~raw_tra['PROD_NAME'].isin(to_remove)]
raw_tra.shape
After removing all unrequired samples we are left with 252364 samples.
Product name also has brand name and product weight or pack size which can be used later in multivariate analysis, so we extract these features from PROD_NAME column.
Adding PACK_SIZE feature
raw_tra['PACK_SIZE'] = raw_tra.PROD_NAME.map( lambda x: int(''.join(re.findall('[0-9]',x))) )
Checking max and minimum pack sizes for sanity check
max(raw_tra.PACK_SIZE),min(raw_tra.PACK_SIZE)
Now we check for number of transactions based on pack size
pack_size_trans = raw_tra.PACK_SIZE.value_counts()
plt.figure(figsize=(14,5))
ax = pack_size_trans.plot.bar(stacked=True)
plt.xticks(range(len(pack_size_trans)), pack_size_trans.index, rotation=0)
plt.xlabel('Pack Sizes (in grams) -->')
plt.ylabel('No. of transactions -->')
plt.title('Pack Size wise transactions Plot')
plt.grid()
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005), xytext=(-5, 1), textcoords='offset points')
plt.show()
Adding BRAND_NAME feature
Before adding brand name we clean incorrectly spelled brand names
raw_tra.PROD_NAME = raw_tra.PROD_NAME.map(lambda x: re.sub('Snbts','Sunbites',x))
raw_tra.PROD_NAME = raw_tra.PROD_NAME.map(lambda x: re.sub('Dorito Corn','Doritos Corn',x))
raw_tra.PROD_NAME = raw_tra.PROD_NAME.map(lambda x: re.sub('Infzns','Infuzions',x))
raw_tra.PROD_NAME = raw_tra.PROD_NAME.map(lambda x: re.sub('Smith Crinkle','Smiths Crinkle',x))
raw_tra.PROD_NAME = raw_tra.PROD_NAME.map(lambda x: re.sub('GrnWves','Grain Waves',x))
We take first word of PROD_NAME feature and add that as brand name. For some brand names with more than one word we replace them with correct names and decontract the required brand names.
raw_tra['BRAND_NAME'] = raw_tra.PROD_NAME.map(lambda x: ' '.join(x.split()[:1]))
raw_tra.BRAND_NAME = raw_tra.BRAND_NAME.replace('Old', 'Old El')
raw_tra.BRAND_NAME = raw_tra.BRAND_NAME.replace('Grain', 'Grain Waves')
raw_tra.BRAND_NAME = raw_tra.BRAND_NAME.replace('Red', 'Red Rock Deli')
raw_tra.BRAND_NAME = raw_tra.BRAND_NAME.replace('RRD', 'Red Rock Deli')
raw_tra.BRAND_NAME = raw_tra.BRAND_NAME.replace('French', 'French Fries')
raw_tra.BRAND_NAME = raw_tra.BRAND_NAME.replace('Natural', 'Natural Chip Co')
raw_tra.BRAND_NAME = raw_tra.BRAND_NAME.replace('NCC', 'Natural Chip Co')
Most popular brands by number of transactions
brands_trans = raw_tra.BRAND_NAME.value_counts()
plt.figure(figsize=(14,5))
ax = brands_trans.plot.bar(stacked=True)
plt.xticks(range(len(brands_trans)), brands_trans.index, rotation=90)
plt.xlabel('Brands -->')
plt.ylabel('No. of transactions -->')
plt.title('Brand wise transactions Plot')
plt.grid()
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005), xytext=(-5, 1), textcoords='offset points')
plt.show()
PROD_QTY
We check in what quantities people are buying chips?
print(raw_tra.PROD_QTY.value_counts())
plt.figure(figsize=(8,8))
ax = raw_tra.groupby("PROD_QTY")['PROD_QTY'].count().plot.bar()
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005), xytext=(0, 1), textcoords='offset points')
plt.title('Number of Products bought')
plt.ylabel('Number of Customers -->')
plt.xlabel('Product Quantity Bought -->')
plt.grid()
plt.show()
It seems mostly people buy chips in quantity of 2, with some exceptions of 1,3,4 and even 5.
We see there are two transactions with quantity as 200. We check below what has been purchased in this large quantity.
raw_tra[raw_tra.PROD_QTY == 200]
One customer with loyality card number 226000 has bought 200 Doritos chips packet on 19-August-2018 and again same item from same store in the same quantity on 20-May-2019. Inflated sales number suggest this is not a typing error and an actual sale.
Now we examine other transactions of the same customer to identify the pattern if there is any.
raw_tra[raw_tra.LYLTY_CARD_NBR == 226000]
Since we did not find any other transaction from same customer we can assume that this is not a regular retail customer. We remove this because we are only analyzing the retail behaviour of consumers right now and these two observations may skew our statistical parameters.
raw_tra.drop(raw_tra[raw_tra.PROD_QTY == 200].index.values, inplace=True)
raw_tra.shape
TOT_SALES
We check the inter-quantile range of total sales feature through box plot
total_sales = sorted(raw_tra.TOT_SALES)
plt.boxplot(total_sales, showmeans = True)
plt.grid()
plt.title('Feature TOT_SALES Values Box Plot')
plt.show()
min(total_sales), max(total_sales)
PROD_NBR
Product number is the unique number given to each product. Here we find out the most popular and least popular products.
product_numbers = dict(zip(raw_tra.groupby('PROD_NBR')['PROD_NBR'].count().index, raw_tra.groupby('PROD_NBR')['PROD_QTY'].sum()))
product_numbers = {k: v for k, v in sorted(product_numbers.items(), key=lambda item: item[1], reverse=True)}
plt.figure(figsize=(18,4),dpi=300)
plt.plot(list(product_numbers.values()))
plt.grid()
plt.xlabel('Product Number -->')
plt.xticks(range(len(product_numbers)), product_numbers.keys(), fontsize=8, rotation=90)
plt.ylabel('Times Bought -->')
plt.title('Feature PROD_NBR Values Plot')
plt.show()
Most Popular Items with number of quantities bought
threshold = 6000
for pro,qunty in product_numbers.items():
if qunty > threshold: print(raw_tra[raw_tra.PROD_NBR == pro]['PROD_NAME'].iloc[0] + ' : ' + str(qunty))
Least popular items
threshold = 2800
for pro,qunty in product_numbers.items():
if qunty < threshold: print(raw_tra[raw_tra.PROD_NBR == pro]['PROD_NAME'].iloc[0] + ' : ' + str(qunty))
TXN_ID
Unique Id for every transaction. Two different products will have same TXN_ID if they are bought together in a bundle.
raw_tra.TXN_ID.isna().sum()
We check for how many different items can be bought under one TXN_ID and maximum is 3.
raw_tra.TXN_ID.value_counts().unique()
STORE_NBR
Here we examine the stores that are doing maximum number of transactions (not sales)
plt.figure(figsize=(16,5))
raw_tra.STORE_NBR.value_counts().iloc[:50].plot.bar()
plt.xlabel('Store Number -->')
plt.ylabel('No. of Transactions -->')
plt.title('Top 50 Stores - Transactions Wise')
plt.grid()
plt.show()
Stores that have lowest number of transactions
plt.figure(figsize=(12,5))
raw_tra.STORE_NBR.value_counts().iloc[-35:].plot.bar()
plt.xlabel('Store Number -->')
plt.ylabel('No. of Transactions -->')
plt.title('Bottom 35 Stores - Transactions Wise')
plt.grid()
plt.show()
Stores that are doing maximum sales
store_sales = dict(raw_tra.groupby('STORE_NBR')['TOT_SALES'].sum())
store_sales = {k:v for k,v in sorted(store_sales.items(), key=lambda x:x[1], reverse=True)}
no_of_top = 50
plt.figure(figsize=(16,5))
pd.Series(store_sales).iloc[:no_of_top].plot.bar()
plt.xlabel('Store Number -->')
plt.ylabel('Sales -->')
plt.title('Top %s Stores with max Sales' % str(no_of_top))
plt.grid()
plt.show()
Which brand and items our top selling store 226 is selling ?
raw_tra[raw_tra.STORE_NBR == 226]['BRAND_NAME'].value_counts().iloc[:15]
raw_tra[raw_tra.STORE_NBR == 226]['PROD_NAME'].value_counts().iloc[:15]
DATE
Date is given to us in excel format, so we decode it into standard format and extract year, month, day of month & year as seperate features to understand time based customer behaviour.
raw_tra['YEAR'] = raw_tra.DATE.map(lambda x: datetime.fromordinal(datetime(1900, 1, 1).toordinal() + x - 2).timetuple().tm_year)
raw_tra['MONTH'] = raw_tra.DATE.map(lambda x: datetime.fromordinal(datetime(1900, 1, 1).toordinal() + x - 2).timetuple().tm_mon)
raw_tra['M_DAY'] = raw_tra.DATE.map(lambda x: datetime.fromordinal(datetime(1900, 1, 1).toordinal() + x - 2).timetuple().tm_mday)
raw_tra['Y_DAY'] = raw_tra.DATE.map(lambda x: datetime.fromordinal(datetime(1900, 1, 1).toordinal() + x - 2).timetuple().tm_yday)
raw_tra.head(3)
We check number of transactions in each year. This seems to be normal.
print(raw_tra.YEAR.value_counts())
raw_tra.YEAR.value_counts().plot.bar()
Our dataset has samples from July-2018 to June-2019, total of 12 months of data. Below we examine the number of transactions in each month.
We see highest peak in December 2018, which can be due to Christmas and a sharp unknown decline in the month of Feb 2019. Red line denotes the mean transactions in these twelve months.
average_trans = raw_tra[['MONTH','YEAR']].groupby('MONTH').count().mean().values[0]
month_wise_trans = raw_tra[raw_tra.YEAR == 2018][['MONTH','YEAR']].groupby('MONTH').count().append(
raw_tra[raw_tra.YEAR == 2019][['MONTH','YEAR']].groupby('MONTH').count()
)
pd.DataFrame(month_wise_trans.values).rename(columns={0:'July 2018 - June 2019'}).plot()
plt.xticks(range(13), month_wise_trans.index)
plt.axhline(y=average_trans, color='r', linestyle='-', label='Avg. Transactions')
plt.xlabel('Months (2018 to 2019)-->')
plt.ylabel('No. of transactons -->')
plt.title('Month wise transactions plot')
plt.legend()
plt.grid()
plt.show()
Now we will merge purchase data with transactional data to perform further analysis.
df = pd.merge(raw_tra,raw_pur,left_on=['LYLTY_CARD_NBR'], right_on=['LYLTY_CARD_NBR'], how='left')
print(df.shape)
df.head(2)
df.isna().sum()
How much each customer lifestage segment is buying ?
df.groupby('LIFESTAGE')['TOT_SALES'].sum().plot.bar()
plt.title('Sales based on Customer Life Stage')
plt.xlabel('LIFESTAGE Classes -->')
plt.ylabel('No. of Transactions -->')
plt.show()
| LIFESTAGE value counts | |
|---|---|
| OLDER SINGLES/COUPLES | 51932 |
| RETIREES | 47361 |
| OLDER FAMILIES | 46300 |
| YOUNG FAMILIES | 41630 |
| YOUNG SINGLES/COUPLES | 34627 |
| MIDAGE SINGLES/COUPLES | 23919 |
| NEW FAMILIES | 6593 |
From above table we know that number of OLDER SINGLES/COUPLES samples in our dataset is very high, and it could be due to this reason the above figure shows the spending by this customer segment on chips as highest.